Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls
工作流概述
这是一个包含23个节点的复杂工作流,主要用于自动化处理各种任务。
工作流源代码
{
"id": "",
"meta": {
"instanceId": "",
"templateCredsSetupCompleted": true
},
"name": "Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls",
"tags": [],
"nodes": [
{
"id": "ca701618-b2d5-48ee-a503-d3513d018a65",
"name": "Sticky Note",
"type": "n8n-nodes-base.stickyNote",
"position": [
360,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Form - Screaming Frog internal_html.csv upload
This form node is used to trigger the workflow.
It contains **three input fields**:
- Name of the website
- Short description of the website
- **Screaming Frog** export containing the internal URLs
It is recommended to use the **internal_html.csv** export, but **internal_all.csv** will also work, as the workflow includes a filter to process only indexable URLs.
"
},
"typeVersion": 1
},
{
"id": "bc040ca0-f38d-4458-a60c-17f71dbfd1ea",
"name": "Sticky Note1",
"type": "n8n-nodes-base.stickyNote",
"position": [
780,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Extract data from Screaming Frog file
This node extracts data from the **CSV file** provided by the user.
It produces an output that is **easily usable** in the following nodes.
⚠️ **Caution:**
If the uploaded file is **not** the expected Screaming Frog export, the workflow will still proceed but will likely **fail in the next steps** due to missing required fields.
"
},
"typeVersion": 1
},
{
"id": "f71a7d10-847d-48e7-8820-ec0c1e7ea055",
"name": "Sticky Note2",
"type": "n8n-nodes-base.stickyNote",
"position": [
1200,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Set Useful Fields
This node sets **7 key fields** from the Screaming Frog export:
- `url` → from the **\"Address\"** column
- `title` → from the **\"Title 1\"** column
- `description` → from the **\"Meta Description 1\"** column
- `status` → from the **\"Status Code\"** column
- `indexability` → from the **\"Indexability\"** column
- `content_type` → from the **\"Content Type\"** column
- `word_count` → from the **\"Word Count\"** column
**Multi-language compatibility**
If you're using Screaming Frog in **French, Italian, German, or Spanish**, the column names will be different.
However, the workflow is designed to handle this, so it will **still work correctly**! 🥳
"
},
"typeVersion": 1
},
{
"id": "6f6546b8-adeb-4998-ae19-d93525337eb7",
"name": "Set useful fields",
"type": "n8n-nodes-base.set",
"position": [
1340,
60
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "0e7d4a06-83fc-4834-93fe-2e758cbe2307",
"name": "url",
"type": "string",
"value": "={{ $json.Address || $json.Adresse || $json.Dirección || $json.Indirizzo }}"
},
{
"id": "c82f4d4c-9d0b-4c7d-9647-5d0240b58643",
"name": "title",
"type": "string",
"value": "={{ $json['Title 1'] || $json['Titolo 1'] || $json['Titolo 1'] || $json['Título 1'] || $json['Titel 1'] }}"
},
{
"id": "abea81db-ce3b-4ac1-bd21-09ccfffb567a",
"name": "description",
"type": "string",
"value": "={{ $json['Meta Description 1'] || $json['Meta description 1'] }}"
},
{
"id": "2ca75d74-70f8-400b-b862-9da186135915",
"name": "statut",
"type": "string",
"value": "={{ $json['Status Code'] || $json['Code HTTP'] || $json['Status-Code'] || $json['Código de respuesta'] || $json['Codice di stato']}}"
},
{
"id": "754d3202-38b0-4d79-ba24-8078b3244307",
"name": "indexability",
"type": "string",
"value": "={{ $json.Indexability || $json.Indexabilité || $json.Indicizzabilità || $json.Indexabilidad || $json.Indexierbarkeit}}"
},
{
"id": "8bc6583d-bb34-4d22-b310-fe79bb8ac85d",
"name": "content_type",
"type": "string",
"value": "={{ $json['Content Type'] || $json['Type de contenu'] || $json['Tipo di contenuto'] || $json['Tipo de contenido'] || $json['Inhaltstyp']}}"
},
{
"id": "c874ba1a-769e-43d3-9555-8c9914ca9b76",
"name": "word_count",
"type": "string",
"value": "={{ $json['Word Count'] || $json['Nombre de mots'] || $json['Conteggio delle parole'] || $json['Conteggio delle parole'] || $json['Recuento de palabras'] || $json['Wortanzahl'] }}"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "1a9af7a0-d2d5-44cb-9770-2d5a1e5706f4",
"name": "Text Classifier",
"type": "@n8n/n8n-nodes-langchain.textClassifier",
"disabled": true,
"position": [
2260,
60
],
"parameters": {
"options": {},
"inputText": "=url : {{ $json.url }}
title : {{ $json.title }}
description : {{ $json.description }}
words count : {{ $json.word_count }}",
"categories": {
"categories": [
{
"category": "useful_content",
"description": "Pages that are likely to contain high-quality content, making them suitable for inclusion in a file that aids content discovery for an LLM. "
},
{
"category": "other_content",
"description": "Pages that should not be included (e.g., pagination, or low-value content)."
}
]
}
},
"typeVersion": 1
},
{
"id": "74a4e378-4228-4142-92ca-e541efde2b15",
"name": "OpenAI Chat Model",
"type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
"position": [
2180,
240
],
"parameters": {
"model": {
"__rl": true,
"mode": "list",
"value": "gpt-4o-mini"
},
"options": {}
},
"credentials": {
"openAiApi": {
"id": "",
"name": "OpenAi Connection"
}
},
"typeVersion": 1.2
},
{
"id": "63dc6cfe-bc73-43b5-8c7d-4f5fd6501d3b",
"name": "No Operation, do nothing",
"type": "n8n-nodes-base.noOp",
"position": [
2580,
200
],
"parameters": {},
"typeVersion": 1
},
{
"id": "cb555b99-9e63-4b6b-a1fc-512b5467d666",
"name": "Sticky Note3",
"type": "n8n-nodes-base.stickyNote",
"position": [
1620,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Filter URLs
This **filter node** is used to keep only the URLs that meet the following conditions:
- `status` = **200**
- `indexability` = **indexable**
- `content_type` contains **text/html**
These filters are even **more useful** if the uploaded file is an **internal_all.csv** instead of an **internal_html.csv**.
### **Tips:**
You can **add more filters** to refine the URLs included in your `llms.txt` file.
💡 **Examples:**
- **Filter by word count** → Ensure pages contain **enough text content**.
- **Filter by URL path** → Keep only **specific folders or categories** in the `llms.txt` file.
- **Filter by meta description** → Exclude URLs **without a meta description**, as this field will be used in the `llms.txt` file to describe each piece of content.
"
},
"typeVersion": 1
},
{
"id": "e34e56e2-5cc8-4e50-bfb0-3aa2e5e04abf",
"name": "Filter URLs",
"type": "n8n-nodes-base.filter",
"position": [
1740,
60
],
"parameters": {
"options": {},
"conditions": {
"options": {
"version": 2,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "cef4feaa-1c46-45b1-92b7-f5c2051b1dc5",
"operator": {
"type": "number",
"operation": "equals"
},
"leftValue": "={{ Number($json.statut) }}",
"rightValue": 200
},
{
"id": "bb821656-9740-4da4-8aa9-f65ad098c470",
"operator": {
"type": "boolean",
"operation": "true",
"singleValue": true
},
"leftValue": "={{ [\"Indexable\", \"Indicizzabile\", \"Indexierbar\"].includes($json.indexability) }}",
"rightValue": "={{ \"Indexable\" || \"Indicizzabile\" }}"
},
{
"id": "5c93ddb8-8091-406a-bc04-fa14e8b73fb9",
"operator": {
"type": "string",
"operation": "contains"
},
"leftValue": "={{ $json.content_type }}",
"rightValue": "text/html"
}
]
}
},
"typeVersion": 2.2
},
{
"id": "b98f19a8-afd3-4d26-8063-dee3ee75055f",
"name": "Sticky Note4",
"type": "n8n-nodes-base.stickyNote",
"position": [
2040,
-800
],
"parameters": {
"color": 2,
"width": 740,
"height": 1160,
"content": "## Text Classifier
🚫 **This node is deactivated by default** in the template.
You can **enable it** if you want to add a more **\"intelligent\" 🤓 filter** to refine the URLs included in the `llms.txt` file, helping LLMs discover and prioritize valuable content.
### How It Works:
This node has **two outputs**:
- **`useful_content`** → Pages that are **likely to contain high-quality content**, making them suitable for inclusion in a file that **aids content discovery for an LLM**.
- **`other_content`** → Pages that should **not** be included (e.g., pagination or low-value content).
You can **modify the description** in the node to fine-tune the classification according to your needs.
### Input Fields:
- **url** → `{{ $json.url }}`
- **title** → `{{ $json.title }}`
- **description** → `{{ $json.description }}`
- **word_count** → `{{ $json.word_count }}`
### Why use an LLM?
A **language model (LLM)** can **analyze** the **URL, title, and description** to identify pages that **most likely contain meaningful and relevant content**.
This allows it to **prioritize valuable pages** and structure the data for **better content discovery and training purposes**.
### **For large websites**
If you have a **very large website**, consider using a **Loop Over Items** node to make the workflow **more robust** and ensure all pages are processed.
Also, using a **Loop Over Items** node make it **easier** to handle:
- **Timeouts**
- **API quotas**
- **Other scalability issues**
### Tokens usage
Finally, keep in mind that **more pages mean more tokens and more billed LLM API calls**.
"
},
"typeVersion": 1
},
{
"id": "63e3ea7a-cec3-442c-9812-771def0a9949",
"name": "Sticky Note5",
"type": "n8n-nodes-base.stickyNote",
"position": [
2840,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Set Field - llms.txt Row
This node **sets** the row format for the `llms.txt` file.
### Row Structure:
Each row follows this format:
- `- [title](link): description`
If the URL **has no description** (from the **Meta Description** in the Screaming Frog export), the row will be:
- `- [title](link)`
"
},
"typeVersion": 1
},
{
"id": "78f58220-feb5-4044-b994-39a0e4f1e9e4",
"name": "Sticky Note6",
"type": "n8n-nodes-base.stickyNote",
"position": [
3260,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Summarize - Concatenate
This node concatenates all the output from the previous node, ensuring each row is on a separate line."
},
"typeVersion": 1
},
{
"id": "7a119633-7cd3-4de5-a1cd-7f708e1abf4a",
"name": "Sticky Note7",
"type": "n8n-nodes-base.stickyNote",
"position": [
3680,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Set Fields - llms.txt Content
This node sets the content of the `llms.txt` file using:
- The **website title** provided in the form (first node).
- The **website description** provided in the form (first node).
- The output from the previous node, which includes all the URLs, their titles, and their descriptions that will appear in the `llms.txt` file.
"
},
"typeVersion": 1
},
{
"id": "554f6858-68e8-4b35-a6c4-21bed6832323",
"name": "Sticky Note8",
"type": "n8n-nodes-base.stickyNote",
"position": [
4100,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## Generate llms.txt file
This node **creates** the `llms.txt` file, which can be **downloaded directly** within n8n.
"
},
"typeVersion": 1
},
{
"id": "24bdefba-e2f2-41f0-93e7-9f8d2fc11f43",
"name": "Sticky Note9",
"type": "n8n-nodes-base.stickyNote",
"position": [
4520,
-500
],
"parameters": {
"color": 7,
"width": 360,
"height": 860,
"content": "## upload file anywhere
Instead of downloading the file directly from the n8n workflow, you can **replace this node node** with a Drive node (e.g., **Google Drive** or **OneDrive**) to upload the `llms.txt` file to a folder of your choice.
**Name the file properly** (e.g., include the website name) to make it easier to find and distinguish between files when working on multiple websites.
"
},
"typeVersion": 1
},
{
"id": "a3be51e3-810c-40a7-a996-98a3d383c2b9",
"name": "Summarize - Concatenate",
"type": "n8n-nodes-base.summarize",
"position": [
3380,
40
],
"parameters": {
"options": {},
"fieldsToSummarize": {
"values": [
{
"field": "llmTxtRow",
"separateBy": "
",
"aggregation": "concatenate"
}
]
}
},
"typeVersion": 1.1
},
{
"id": "8d3a892a-3d11-4d8a-8ec6-84f8f3af1183",
"name": "Set Fields - llms.txt Content",
"type": "n8n-nodes-base.set",
"position": [
3820,
40
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "97062a99-e944-4e1e-89b1-62cf9e3462dd",
"name": "llmsTxtFile",
"type": "string",
"value": "=# {{ $('Form - Screaming frog internal_html.csv upload').item.json['What is the name of your website?'] }}
> {{ $('Form - Screaming frog internal_html.csv upload').item.json['Can you provide a short description of your website? (in the language of the website)'] }}
{{ $json.concatenated_llmTxtRow }}"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "bc2a692a-47ea-4bf1-a102-e607fd544158",
"name": "upload file anywhere",
"type": "n8n-nodes-base.noOp",
"position": [
4640,
40
],
"parameters": {},
"typeVersion": 1
},
{
"id": "404510a2-35b2-44cf-9d02-eb0abcf4e9b3",
"name": "Set Field - llms.txt Row",
"type": "n8n-nodes-base.set",
"position": [
2960,
40
],
"parameters": {
"options": {},
"assignments": {
"assignments": [
{
"id": "95e75caa-8110-476b-9cb1-73c15361fa56",
"name": "llmTxtRow",
"type": "string",
"value": "=- [{{ $json.title }}]({{ $json.url }}){{ $json.description ? ': ' + $json.description : '' }}"
}
]
}
},
"typeVersion": 3.4
},
{
"id": "f54d51f2-17bc-4c58-b177-0e77e16f7b72",
"name": "Sticky Note10",
"type": "n8n-nodes-base.stickyNote",
"position": [
-420,
-1020
],
"parameters": {
"color": 5,
"width": 700,
"height": 1380,
"content": "# Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls
This workflow helps you generate an **llms.txt** file (if you're unfamiliar with it, check out [this article](https://towardsdatascience.com/llms-txt-414d5121bcb3/)) using a **Screaming Frog export**.
[Screaming Frog](https://www.screamingfrog.co.uk/seo-spider/) is a well-known website crawler.
You can easily crawl a website. Then, export the **\"internal_html\"** section in CSV format.
## How It Works:
A **form** allows you to enter:
- The **name of the website**
- A **short description**
- The **internal_html.csv** file from your Screaming Frog export
Once the form is submitted, the **workflow is triggered automatically**, and you can **download the llms.txt file directly from n8n**.
## Downloading the File
Since the last node in this workflow is **\"Convert to File\"**, you will need to **download the file directly from the n8n UI**.
However, you can easily **add a node** (e.g., Google Drive, OneDrive) to automatically upload the file **wherever you want**.
## AI-Powered Filtering (Optional):
This workflow includes a **text classifier node**, which is **deactivated by default**.
- You can **activate it** to apply a more **intelligent filter** to select URLs for the `llms.txt` file.
- Consider modifying the **description** in the classifier node to specify the type of URLs you want to include.
## How to Use This Workflow
1. **Crawl the website** you want to generate an `llms.txt` file for using **Screaming Frog**.
2. **Export the \"internal_html\"** section in CSV format.

3. In **n8n**, click **\"Test Workflow\"**, fill in the form, and **upload** the `internal_html.csv` file.
4. Once the workflow is complete, go to the **\"Export to File\"** node and **download the output**.
**That's it! You now have your llms.txt file!**
**Recommended Usage:**
Use this workflow **directly in the n8n UI by clicking** 'Test Workflow' and uploading the file in the form."
},
"typeVersion": 1
},
{
"id": "e33104af-802a-43f2-b26d-f368f7de2fd7",
"name": "Form - Screaming frog internal_html.csv upload",
"type": "n8n-nodes-base.formTrigger",
"position": [
460,
60
],
"webhookId": "8791f39a-3d81-405c-b177-0a733ebf74cb",
"parameters": {
"options": {
"buttonLabel": "Get the llms.txt file"
},
"formTitle": "llms.txt Generator - From Screaming Frog export",
"formFields": {
"values": [
{
"fieldLabel": "What is the name of your website?",
"placeholder": "Example : The best website ever",
"requiredField": true
},
{
"fieldLabel": "Can you provide a short description of your website? (in the language of the website)",
"placeholder": "Example : This is the best website ever because all the content is engaging and valuable.",
"requiredField": true
},
{
"fieldType": "file",
"fieldLabel": "screaming_frog_export",
"multipleFiles": false,
"requiredField": true,
"acceptFileTypes": ".csv"
}
]
},
"responseMode": "lastNode",
"formDescription": "Generate a simple llms.txt file from a Screaming Frog Export
It is recommended to use the internal_html.csv export, although internal_all.csv will also work.
Fill in the fields in this form.Just fill in the fields in this form 😄"
},
"typeVersion": 2.2
},
{
"id": "f6b17fdd-a098-411e-8d53-3f6e638cc3ba",
"name": "Extract data from Screaming Frog file",
"type": "n8n-nodes-base.extractFromFile",
"position": [
900,
60
],
"parameters": {
"options": {},
"operation": "xls",
"binaryPropertyName": "screaming_frog_export"
},
"typeVersion": 1
},
{
"id": "6bbd8d1f-3322-4c6d-af08-c842386239ce",
"name": "Generate llms.txt file",
"type": "n8n-nodes-base.convertToFile",
"position": [
4220,
40
],
"parameters": {
"options": {
"encoding": "utf8",
"fileName": "llms.txt"
},
"operation": "toText",
"sourceProperty": "llmsTxtFile"
},
"typeVersion": 1.1
}
],
"active": false,
"pinData": {},
"settings": {
"executionOrder": "v1"
},
"versionId": "",
"connections": {
"Filter URLs": {
"main": [
[
{
"node": "Text Classifier",
"type": "main",
"index": 0
}
]
]
},
"Text Classifier": {
"main": [
[
{
"node": "Set Field - llms.txt Row",
"type": "main",
"index": 0
}
],
[
{
"node": "No Operation, do nothing",
"type": "main",
"index": 0
}
]
]
},
"OpenAI Chat Model": {
"ai_languageModel": [
[
{
"node": "Text Classifier",
"type": "ai_languageModel",
"index": 0
}
]
]
},
"Set useful fields": {
"main": [
[
{
"node": "Filter URLs",
"type": "main",
"index": 0
}
]
]
},
"Generate llms.txt file": {
"main": [
[]
]
},
"Summarize - Concatenate": {
"main": [
[
{
"node": "Set Fields - llms.txt Content",
"type": "main",
"index": 0
}
]
]
},
"Set Field - llms.txt Row": {
"main": [
[
{
"node": "Summarize - Concatenate",
"type": "main",
"index": 0
}
]
]
},
"Set Fields - llms.txt Content": {
"main": [
[
{
"node": "Generate llms.txt file",
"type": "main",
"index": 0
}
]
]
},
"Extract data from Screaming Frog file": {
"main": [
[
{
"node": "Set useful fields",
"type": "main",
"index": 0
}
]
]
},
"Form - Screaming frog internal_html.csv upload": {
"main": [
[
{
"node": "Extract data from Screaming Frog file",
"type": "main",
"index": 0
}
]
]
}
}
}
功能特点
- 自动检测新邮件
- AI智能内容分析
- 自定义分类规则
- 批量处理能力
- 详细的处理日志
技术分析
节点类型及作用
- Stickynote
- Set
- @N8N/N8N Nodes Langchain.Textclassifier
- @N8N/N8N Nodes Langchain.Lmchatopenai
- Noop
复杂度评估
配置难度:
维护难度:
扩展性:
实施指南
前置条件
- 有效的Gmail账户
- n8n平台访问权限
- Google API凭证
- AI分类服务订阅
配置步骤
- 在n8n中导入工作流JSON文件
- 配置Gmail节点的认证信息
- 设置AI分类器的API密钥
- 自定义分类规则和标签映射
- 测试工作流执行
- 配置定时触发器(可选)
关键参数
| 参数名称 | 默认值 | 说明 |
|---|---|---|
| maxEmails | 50 | 单次处理的最大邮件数量 |
| confidenceThreshold | 0.8 | 分类置信度阈值 |
| autoLabel | true | 是否自动添加标签 |
最佳实践
优化建议
- 定期更新AI分类模型以提高准确性
- 根据邮件量调整处理批次大小
- 设置合理的分类置信度阈值
- 定期清理过期的分类规则
安全注意事项
- 妥善保管API密钥和认证信息
- 限制工作流的访问权限
- 定期审查处理日志
- 启用双因素认证保护Gmail账户
性能优化
- 使用增量处理减少重复工作
- 缓存频繁访问的数据
- 并行处理多个邮件分类任务
- 监控系统资源使用情况
故障排除
常见问题
邮件未被正确分类
检查AI分类器的置信度阈值设置,适当降低阈值或更新训练数据。
Gmail认证失败
确认Google API凭证有效且具有正确的权限范围,重新进行OAuth授权。
调试技巧
- 启用详细日志记录查看每个步骤的执行情况
- 使用测试邮件验证分类逻辑
- 检查网络连接和API服务状态
- 逐步执行工作流定位问题节点
错误处理
工作流包含以下错误处理机制:
- 网络超时自动重试(最多3次)
- API错误记录和告警
- 处理失败邮件的隔离机制
- 异常情况下的回滚操作